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Abstract 

A  inethod  is  presented  for  using  connectionist  networks  of  simple  computing  ele¬ 
ments  to  discover  a  particular  type  of  constrsunt  in  multidimensional  data.  Suppose 
that  some  data  source  provides  samples  consisting  of  n-dimensional  feature-vectors, 
but  that  this  data  all  happens  to  lie  on  an  m-dimensional  surface  embedded  in  the 
n-dimensional  feature  space.  Then  occurrences  of  data  can  be  more  concisely  de¬ 
scribed  by  specifying  an  m-dimensional  location  on  the  embedded  surface  than 
by  reciting  all  n  components  of  the  feature  vector.  The  recoding  of  data  in  such 
a  way  is  known  as  dimensionality-reduction.  This  paper  describes  a  method  for 
performing  dimensionality-reduction  in  a  wide  class  of  situations  for  which  an  as¬ 
sumption  of  linearity  need  not  be  made  about  the  underlying  constraint  surface. 
The  method  takes  advantage  of  self-organizing  properties  of  connectionist  networks 
of  simple  computing  elements.  We  present  a  scheme  for  representing  the  values 
of  continuous  (scalar)  variables  in  subsets  of  units.  The  backpropagation  weight 
updating  method  for  training  connectionist  networks  is  extended  by  the  use  of  aux¬ 
iliary  pressure  in  order  to  coax  hidden  units  into  the  prescribed  representation  for 
scalar-valued  variables. 
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1  Introduction 


A  common  objective  in  Pattern  Recognition,  Artificial  Intelligence,  and  I 

Machine  Vision  is  to  discover  good  representations  in  which  to  describe  | 

data.  An  important  step  in  generating  a  good  representation  is  to  expose 
constraint  and  remove  redtmdancy  in  the  description  of  sensory  data.  One 
particular  form  of  constraint  that  can  occur  over  a  collection  of  data  is  illus¬ 
trated  in  figure  1.  Here,  some  data  source  generates  (x,y)  data  points.  Ev¬ 
idently,  the  source  operates  under  some  constraint  because  all  data  points 
appear  to  lie  on  a  one-dimensional  curve.  If  one  possesses  knowledge  of 
this  curve  then  one  may  describe  any  ^ven  data  sample  using  only  a  single 
number,  the  location  along  the  curve,  instead  of  the  two  numbers  required 
to  spell  out  the  (x,  y)  coordinates. 

This  form  of  data  recoding  is  known  as  dimensionality-reduetion  [Kr- 
ishnaiah  and  Kanal,  1982;  Kohonen,  1984].  In  general,  the  problem  is 
to  define  a  computational  mapping  between  locations  in  an  n-dimensional 
feature  space,  and  locations  on  an  m-dimensional  eonstrairU  aurfaee  embed¬ 
ded  in  the  n-dimensional  feature  space,  given  a  set  of  samples  drawn  from 
the  constraint  surface.  The  purpose  served  by  dimensionality-reduction 


Figure  1.  Two-dimensional  data  samples  constrained  to  lie  on  a 
one-dimensionsl  constraint  surfKe.  a.  Through  dimensionality- 
reduction,  the  parameter,  h,  describes  data  in  terms  of  location  on 
the  constraint  surface,  b.  Data  on  (closed  circles)  and  off  (open  cir¬ 
cles)  the  constraint  surface  appears  identical  under  dimensionality- 
reduction  by  pure  linear  transformation. 


is  abstfmction  and  simplification  in  the  description  of  data.  In  general,  a 
dimensionality-reduced  representation  can  be  expected  to  make  explicit  de¬ 
scriptive  parameters  capturing  the  natural  degrees  of  variability  inherent 
to  a  class  of  data,  while  it  factors  out  redundancy  latmt  among  the  original 
measured  features. 

Most  previous  approaches  to  dimensionality-reduction  in  pattern  recog¬ 
nition  assume  a  linear  constraint  surface.  Then,  the  techniques  of  fac¬ 
tor  analysis,  principle  components  analysis,  or  equivalently,  the  Karhunen- 
Lokve  expansion  [Watanabe,  1966;  Tou  and  Heydom,  1967;  Fukunaga  and 
Koonts,  1970;  Kittler  and  Young,  1973],  amount  to  first,  rotating  the  n- 
dimensional  coordinate  system  to  align  with  the  data,  and  second,  pick¬ 
ing  out  the  few  dimensions  that  best  account  for  the  variance.  Linear 
dimensionality-reduction  methods  buy  simplicity  at  a  sacrifice  of  generality. 
Clearly,  they  would  not  work  for  the  constraint  surface  pictured  in  figure 

1  because  this  constraint  surface  doubles  back  on  itself  in  the  embedding 
space.  Furthermore,  dimensionality-reduction  based  on  approximar- 
tion  to  a  constraint  surface  cannot  accurately  determine  whether  or  not  a 
ipven,  unknown,  data  sample  does  or  does  not  lie  on  the  constraint  surface, 
as  shown  in  figure  lb. 

This  paper  presents  a  more  general  method  for  achieving  dimensionality- 
reduction  which  allows  the  underlying  constraint  sur&ce  to  curve  to  a  con¬ 
siderable  degree.  The  method  relies  on  the  self-organising  properties  of 
connectionist  networks  of  simple  coiiq>uting  elements  [Rumelhart,  McClel¬ 
land,  et  al,  1986].  The  technique  differs  from  that  of  Kohonen  [1984],  with 
which  it  is  compared  in  the  discussion  section.  An  earlier  version  of  this 
work  appears  in  [Saund,  1986]. 

2  Black-Box  Dimensionality-Reducer 

The  *black  box*  depiction  of  a  dimensionality-reducer  is  presented  in  figure 
2.  Each  such  box  contains  knowledge  of  one  constraint  surface  in  the  feature 
spSKs.  At  the  bott<xn  of  the  box  enters  the  description  of  a  data  point  in 
terms  of  an  fi-dimensional  feature  vector.  Out  the  top  emerge  n  lines, 
aad  out  the  ndc,  m  noOTe,  where  m  is  the  dimensionality  of  the  constraint 
wmxfmc*.  Each  line  can  represent  a  bounded  real  number;  for  convenience 
suppcse  that  the  feature  coordinates  are  normalised  so  that  all  features 
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Figure  2.  Black-box  dimenaionality-reducer. 


take  values  between  0  and  1. 

The  dimensionality-reducer  operates  as  follows.  If  the  numbers  coming 
out  the  top  of  the  black  box  match  those  coming  in  the  bottom,  then  it  b 
determined  that  the  data  point  whose  feature  vector  b  given  does  in  fact 
lie  on  the  constraint  surface,  and  its  location  on  the  constraint  surface  may 
be  read  on  the  m  lines  coming  out  the  side  (the  dimensionality-reducer 
implicitly  creates  a  coordinate  system  for  the  constraint  surface).  If  the 
numbers  coming  out  the  top  do  not  match  the  input  feature  vector,  then  it 
is  determined  that  the  data  point  specified  at  the  input  does  not  lie  on  the 
constraint  surface.  In  this  way  a  dimensionality-reducer  carries  out  a  form 
of  pattern  matching  between  input  data  and  a  parameterixed  “template” 
defined  by  the  locus  of  points  on  the  constraint  surface. 

Thb  black-box  dimensionality-reducer  may  also  be  used  in  the  oppo¬ 
site  direction.  That  is,  an  m-dimensional  vector  specifying  a  location  on 
the  constraint  surface  may  be  placed  on  the  side  lines  as  input,  and  the 
dimensionality-reducer  will  then  compute,  and  output  at  the  top,  the  co¬ 
ordinates  of  this  data  point  in  the  n-dimensional  feature  space. 

The  contents  of  the  black  box  dimensionality-reducer  described  in  this 
paper  are  a  network  of  simple  computing  elements. 


8  Connectionist  Networks 


Recently,  renewed  attention  has  been  directed  to  “connectionist”  networks 
of  simple  con4>utinK  elements.  It  is  believed  that  new  developments  in 
training  procedures  applied  to  multilayer  networks  will  overcome  some  of 
the  well-known  objections  to  classical  perceptrons  [Rosenblatt,  1962;  Min¬ 
sky  and  Papert,  I960].  Of  interest  here  are  demonstrations  by  Hinton,  et 
al  [1064],  and  Rumelhart,  et  al  [1985, 1986],  that  hidden  units  in  multilayer 
networlu  are  able  to  attain  abstract  representations  capturing  constraint 
in  binary  input  data. 

A  prototypical  exmnple  is  the  "encoder  problem”  (see  figure  3).  Here, 
the  activity,  ai,,  in  a  layer  of  eight  output  units  is  calculated  from  the 
activity,  in  the  middle  layer  of  three  units: 

ak  =  /(«fc)  =  f(^  +  6*),  (1) 

j 

where  is  the  connection  weight  between  the  yth  middle  layer  unit  and 
the  kth  output  layer  unit.  6*  is  a  bias  input;  it  is  omitted  from  succeeding 
equations  because  it  can  be  considered  equivalent  to  the  weight,  w,  from  a 
unit  whose  output  is  always  1.  The  activity  in  the  middle  layer  is  calculated 
from  the  activity  in  the  input  layer  in  a  similar  way.  The  middle  layer  units 
are  called  "hidden”  imits  because  their  activity  is  not  directly  accessible 
either  at  network  input  or  output.  Each  unit’s  activity  is  bounded  between 
0  and  1  by  /,  which  is  typically  a  sigmoidal  nonlinear  function,  for  example, 

/(*)  =  (2) 

The  goal  of  the  encoder  problem  is  to  set  the  weights  such  that  if  any 
single  unit  in  the  input  layer  is  set  ON  (near  1),  and  the  rest  set  OFF  (near 
0),  the  activity  adll  propagate  so  that  only  the  corresponding  output  unit 
will  be  ON.  Because  data  presented  to  the  input  layer  is  constrained,  so 
that  <mly  a  subset  of  all  possible  patterns  of  activity  in  the  input  layer 
ever  actually  appear,  the  information  as  to  which  unit  is  ON  may  be  prop¬ 
agated  to  the  output  layer  through  a  middle  layer  containing  fewer  than 
eight  units.  In  particular,  the  three  middle  layer  units  may  transmit  the 
information  by  adopting  a  binary  code.  Rumelhart,  Hinton,  and  th«r  cd- 
leagues  have  described  a  method,  called  baekjnopafation,  for  training  a 
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Figure  3.  Three  layer  connecticmiat  uelwork.  Activity  at  the  input 
layer  drives  hidden  layer,  activity  at  hidden  layer  drives  output  layer, 
b.  Activity  at  a  unit  is  computed  as  a  semilinear  function  of  the 
weighted  sum  of  the  unit’s  inputs,  c.  Activity  in  an  8-3-8  network 
trained  for  the  encoder  problem.  Input  is  constrained  so  that  only 
one  input  unit  is  ON  at  a  time.  Activity  at  output  matches  input. 
The  information  as  to  which  input  is  ON  is  able  to  be  transmitted 
via  the  hidden  unit  layer  of  only  three  units.  Sue  circle  represents 
unit’s  activity. 
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netvvork  to  achieve  such  b^avior  without  directly  specifying  the  behavior 
of  the  hidden  layer:  repeatedly  present  the  network  with  sample  input  and 
allow  activity  to  propagate  to  the  output  layer,  observe  the  error  between 
the  desired  output  and  the  observed  activity  at  the  output  layer,  and  use 
these  errors  to  update  the  network  weights.  This  technique  is  described  in 
more  detail  below. 

Here  we  describe  a  method  for  extending  the  backpropagation  weight 
updating  for  connectionist  networks  in  order  to  perform  dimension¬ 

ality  reduction  over  continuous-valued  data. 

4  Repreflenting  Scalar  Variables 

We  must  start  by  endowing  networks  with  the  ability  to  represent  not  just 
the  binary  variables,  ON  and  OFF,  but  variables  continuous  over  some 
interval.  For  convenience  let  this  interval  be  (0,1). 

One  conceivable  way  of  representing  scalars  in  simple  units  is  via  a 
unit’s  activity  level  itself.  Since  only  one  weight  connects  any  middle- 
layer  unit  to  any  given  output  unit,  this  strategy  is  clearly  inadequate 
for  representing  anything  but  linear  relationships  between  variables.  The 
relationship  between  x  and  y  in  figure  1  is  not  linear,  so  the  relationship 
between  x  and  some  hidden  variable,  h,  and  between  y  and  h  must  not 
both  be  linear.  Therefore,  we  instead  use  the  following  scheme. 

Each  scalar  feature  value  is  represented  as  the  pattern  of  activity  ovc  r 
a  set  of  units,  called  a  scalar-set,  as  illustrated  in  figure  4.  The  pattern  of 
activity  is  determined  by  sampling  a  unimodal  smearing  function,  cei  - 
tered  at  the  location  along  the  (0, 1)  interval  corresponding  to  the  intended 
scalar  value.  The  exact  form  of  the  smearing  function  is  not  important; 
here,  it  happens  to  be  the  derivative  of  the  function,  /,  of  equation  (2), 
but  other  forms,  such  as  the  gatissian,  may  be  used.  The  parameter,  u, 
controls  the  width  of  the  smearing  effect.  This  method  for  encoding  scalar 
values  achieves  a  form  of  interpolation.  Resolution  of  better  than  1  part  in 
50  may  be  achieved  using  as  few  as  eight  units. 

Any  set  of  units  whose  activity  pattern  displays  the  characteristic  uni¬ 
modal  form  for  encoding  scalar  values  is  said  to  exhibit  “scalarixed”  be¬ 
havior.  The  number  expressed  in  a  scalarised  pattern  of  activity  may  be 
decoded  as  the  placement  of  the  smearing  function,  5,  at  the  location,  x. 


Figure  4.  a  Scalar  values  between  0  and  1  are  represented  in  tcalar- 
$et$  of  units  whose  activity  takes  a  characteristic  pattern  defined  by 
sampling  a  unimodal  •meartng  function,  Su.  The  activity  pattern 
shown  in  this  12-unit  scalar-set  represents  the  number,  .4.  b.  Su 
for  several  values  of  the  smearing  width  parameter,  w.  c.  Scalarised 
activity  in  a  two-dimensional  scalar-set. 


(3) 


within  the  (0, 1)  interval,  that  minimiies  the  least-square  difference, 

• 

Two  means  are  available  for  representing  multidimensional  vectors  in 
terms  of  scalar-sets.  The  most  straightforward  method  is  to  provide  one 
one-dimensional  scalar-set  for  each  scalar  component  of  the  vector.  How¬ 
ever  in  order  to  perform  dimensionality-reduction  it  is  also  necessary  to 
employ  an  alternative  representation  in  which  the  scalar-set  itself  takes 
the  inherent  dimensionality  of  the  vector.  For  example,  figure  4c  shows  a 
scalarized  pattern  of  activity  over  a  two-dimensional  scalar-set. 

5  Training  the  Network 

A  three-layer  connectionist  network  configured  to  perform  dimensionality- 
reduction  is  illustrated  in  figure  5.  A  one-dimensional  scalar-set  is  provided 
at  the  input  layer  and  at  the  output  layer  for  each  coordinate  dimension  of 
the  higher-dimensionai  feature  space.  The  hidden  layer  is  configured  as  a 
scalar-set  whose  dimensicmality  matches  that  of  the  embedded  constraint 
surface.  The  input,  hidden,  and  output  layers  of  the  network  correspond 
to  the  bottom,  side,  and  top  of  the  Mack  box  dimensionality-reducer. 

Dimensionality-reducing  behavior  is  achieved  by  virtue  of  the  link  weights 
between  successive  layers  ci  the  network.  These  weights  are  established 
using  the  backpropagation  training  procedure,  which  furnishes  crucial  self- 
organising  properties  during  the  training  phase.  Training  consists  of  r  i- 
peated  presentation  of  input  activity/desired  output  activity  pairs,  where 
the  desired  output  is  defined  to  be  identical  to  input  activity.  At  each 
training  trial,  activity  at  the  input  layer  is  fixed  according  to  the  scalarized 
representation  for  the  coordinates  of  a  d.^ta  sample  expressed  in  terms  of 
the  hiclwlimenai<Mwl  feature  space.  Activity  propagates  through  the  net¬ 
work,  and  the  output  layer  activity  is  compared  with  that  of  the  input  to 
create  an  output  error  vect<».  This  error  is  used  to  update  link  weights 
within  the  netwMk  in  such  a  way  as  to  reduce  the  output  error. 

The  hackpropagatUm  method  for  updating  weights  in  order  to  achieve 
Sanaa  dsswed  input /output  behavior  may  be  derived  by  expressing  the  re- 
latwaohip  helwson  a.)  a  cost  for  error  in  output  behavior,  and  b.)  the 
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Figure  5.  Connectionist  dimenaionality- reducer.  In  this  case,  two 
scalar-sets  are  provided  at  the  input  and  output  layers,  and  the  hid¬ 
den  layer  is  a  one-dimensional  scalar  set,  therefore,  this  network  can 
represent  a  one-dimensional  constraint  curve  in  a  two-dimensional 
feature-space.  During  training  period,  errors  from  desired  activity 
are  used  to  train  network  to  reproduce  input  activity  pattern  at  out¬ 
put.  Auxiliary  error  pressures  hidden  layer  units  into  adopting  a 
scalarized  representation. 


«trengtha  o(  individual  weights  in  the  network.  Fdlowing  Rumelhart,  et  al 
[1'96S],  dafine  cost 

^  P  *  P  h  ^  P  k 


as  the  cost  over  all  output  units,  k,  of  error  between  output  and  target 
output,  tjki  summed  cfvex  all  sets  of  presentations,  p,  of  input  data.  Weights 
are  adjusted  a  slight  amount  at  each  presentation,  p,  so  as  to  attempt  to 
reduce  Ep.  The  amount  to  adjust  each  wnght  connecting  a  middle  layer 
unit  and  an  output  umt  is  proportional  to  (from  (1)  and  (4)) 


dEp 

dv)jk 


8Ep  dfik 
dak  B»k  BvJjk 


^kf'{»k)<kj. 


(5) 


Take 


Aw,*  =  -v6pkf\»k)<ki 


(6) 


as  the  amount  to  adjust  w«ght  wy*  at  presentation  p.  is  a  parameter 
controlling  learning  rate. 

Adjustments  of  weights  between  the  input  and  middle  layers  is  propor¬ 


tional  to 


dEp  dEpdaj  dsj 
dwij  daj  d»j  dv>ij 


(T) 


/ dEp  dak  ^•k  ^  day  dsy 

\k  dwy*  j  day  dwi, 

=  E  iSpkf{9k)v>ik)  /'(«,)«<. 


(8) 

(9) 


Defining 

^pi  ~  /*(•*)  (10) 

k 


we  anfve  at  a  recursive  formula  for  propagating  error  in  output  back 
threvgh  the  network.  Take 


Aw<y  =  -f7«„/'(Sy)o,.  (11) 

Esssillaily,  this  method  for  updating  w«ghts  performs  a  localized  gra- 
diswt  fisneent  ha  error  cost  over  weight  space.  By  asserting  an  equivalent 


10 


hidden  layer  error  vector,  the  backpropagation  step  amounts  to  ana¬ 
lysing  how  each  unit  at  the  hidden  layer  contributes  to  error  observed  at 
the  output  layer. 

In  implementation,  a  “momentum”  factor  is  used  to  smooth  the  trajec¬ 
tory  in  weight  space.  The  actual  weight  adjustment  used  at  iteration  t  is 
^ven  by 

=  Au;(t)  +  aAw(t  —  1),  (12) 

where  a  is  a  parameter  controlling  the  rate  of  decay  of  the  contribution  to 
previous  data  samples. 

6  Auxiliary  Error  to  Constrain  Hidden  Unit  Activity 

If  all  of  the  data  presentations  to  scalar-sets  at  the  input  layer  conform 
to  scalarized  representations  for  the  scalar  components  of  the  data  vector, 
then  after  a  suitable  training  period  the  output  will  come  to  faithfully  mimic 
the  input.  Unfortunately,  there  is  no  guarantee  that  the  hidden  units  will 
adopt  the  scalarised  representation  in  th^  transmission  of  activity  from 
the  input  layer  to  the  output  layer.  In  general  thdr  coding  behavior  will  be 
unpredictable,  depending  upon  the  initial  randomised  state  of  the  network 
and  the  order  of  data  sample  presentation. 

What  is  needed  is  a  way  to  force  the  scalar-set  at  the  hidden  layer  into 
adopting  the  prescribed  scalarised  pattern  of  activity.  For  this  purpose  we 
introduce  an  auxiliary  error  term,  7y,  to  be  added  to  the  error  in  activity 
at  the  hidden  layer,  6j,  which  is  propagated  back  from  the  error  in  activity 
at  the  output  layer  (10).  The  network  waghts  connecting  the  input  laye: 
and  the  middle  layer  are  now  updated  according  to 

Awi,  =  (Atf,  +  (1  -  A)7,j/'(sy)o<,  (13) 

where  A  (0  <  A  <  1)  trades  off  the  relative  contributions  of  6  and  7.  7 
must  be  of  a  suitable  nature  to  pressure  the  hidden  units  into  becoming 
scalarized  as  the  training  proceeds.  We  compute  7y  for  the  units  of  the 
hidden  layer  scalar-set  as  follows. 

We  may  view  the  activity  over  the  set  of  units  in  a  scalar-set  as  the 
transformation,  by  the  smearing  function,  5,  of  some  underlying  “likeli¬ 
hood”  distribution,  p(j),  over  values  in  the  interval,  0  <  j  <  1.  The 
activity  in  a  scalar-set  is  the  convolution  of  the  likelihood  distribution  with 


the  smearing  function,  sampled  at  every  unit.  Scalarized  activity  occme 
when  the  underiying  distribution  is  the  Dirac  Delta  function.  The  strategy 
we  suggest  for  adding  auxiliary  pressure  to  the  scalar-set  activity  is  sim¬ 
ply  to  encourage  scalarised  behavior,  add  some  factor  to  sharpen  the  peaks 
and  lower  the  valleys  of  the  likelihood  distribution,  to  make  it  more  like  the 
Delta  function.  A  convenient  way  of  doing  this  is  to  raise  the  underlying 
distribution  to  some  positive  power,  and  normalise  so  that  the  total  area 
is  unity.  In  the  general  case,  if  this  procedure  were  repeatedly  applied  to 
some  distribution,  one  peak  would  eventually  win  out  over  all  othm.  The 
procedure  is  summarised  by  the  following  equation: 

'r(j)  =(S*JV{  [S‘*  *  a(y)]*})  -  a(j).  (14) 

The  activity  in  the  scalar-set,  a(y),  is  deconvolved  with  the  smoothing 
function,  5,  to  reveal  the  underlying  likelihood  distribution.  This  is  raised 
to  the  power,  g  >  0,  and  then  normalised  (by  N).  This  new  underlying 
likelihood  is  now  convolved  with  the  smoothing  function,  S,  and  7  is  taken 
as  the  error  between  this  result  and  the  current  activity  in  the  scalar-set. 

Now,  on  every  training  trial  the  weight  updating  tmn,  pressures 
hidden  units  to  adopt  activities  that  will  reduce  the  error  between  input 
layer  activity  and  output  layer  activity,  while  the  auxiliary  error  term,  7,, 
pressures  hidden  units  to  adopt  scalarised  activity. 

Ir  an  actiial  implementation,  a(y)  is  not  a  continuous  fimction,  but 
rather  emsists  of  the  activity  in  the  finite,  usually  small,  number  of  \mi» 
in  the  scalar-set.  Therefore  the  bandwidth  available  for  conveying  the  un¬ 
derlying  likelihood,  p(j),  is  small;  sharp  peaks  in  p(j)  are  not  representable 
because  high  spatial  frequency  information  cannot  be  carried  in  a.  An  al¬ 
ternative  expression  for  7,  to  substitute  for  (14),  has  been  found  acceptable 
in  practice: 

Try  =  iv  [s  ♦  o(y)*]  -  o(y)  (15) 

Thus,  we  square  the  activity  in  each  unit,  convolve  this  squared  activity 
in  the  acahtr-set  with  the  smearing  function,  5,  then  normalise  so  that 
t|ie  total  activity  in  the  scalar-set  sums  to  a  constant.  This  procedure  for 
leaeiating  approxinutes  the  effect  of  encouraging  scalarised  patterns  of 
activity  in  the  scalar-set. 


7  Sculpting  the  Cost  Landscape 


As  noted  above,  the  network  training  procedure  carries  out  gradient  de¬ 
scent.  Weights  are  updated  at  each  truning  presentation  so  as  to  reduce 
the  cost,  E.  This  cost  is  high  when  activity  in  the  output  layer  differs  from 
activity  in  the  input  layer,  and,  due  to  the  auxiliary  error  term,  7,  the  cost 
is  high  when  activity  in  the  hidden  layer  scalar-set  does  not  conform  to  the 
characteristic  scalarized  representation  for  scalar  values.  If,  as  is  usually 
the  case,  no  prior  knowledge  of  constndnt  operating  upon  the  data  source 
is  available,  the  network  is  initialized  with  random  values  for  all  weights, 
and  E  will  be  large  at  the  outset. 

Simple  gradient  descent  methods  commonly  suffer  from  the  problem 
that  there  may  exist  local  minima  in  the  cost  landscape  that  are  far  from 
the  global  minimum.  Once  a  network  falls  into  a  local  minimum  there  is 
nc>  escape. 

The  local  minimum  phenomenon  has  been  reported  by  Rumelhairt,  et 
al  [1985 1,  in  normal  binary-variable  connectionist  training,  where  the  only 
pressure  to  adjust  weights  comes  from  error  between  output  and  input 
activity.  It  should  perhaps  not  then  be  surprising  to  encounter  local  minima 
in  the  dimensionality  reduction  problem,  where  we  impose  a  cost  factor  due 
to  non-scalarlike  behavior  in  hidden  units,  in  addition  to  the  normal  cost  for 
output  activity  deviation  from  input.  In  effect,  what  we  are  doing  is  adding 
two  cost  landscapes  to  one  another.  The  weight  adjustment  that  reduces 
one  cost  component  may  raise  the  other.  Figure  6  is  a  simple  illustration 
of  one  way  in  which  adding  two  cost  landscapes  can  create  local  minima. 

Two  strategies  have  been  proposed  for  surmounting  the  local  minimum 
problem.  One  is  simulated  annealing  in  a  Boltzmann  machine  [Kirkpatrick, 
et  al,  1983;  Hinton,  et  al,  1984].  Briefly,  simulated  annealing  allows  the 
training  process  to  adjust  weights  probabilistically  so  as  to  increase  cost. 
This  allows  the  procedure  to  jump  out  of  local  minima  in  cost.  Boltzmann 
mau:hine  learning  can  be  slow,  and  it  requires  certain  conditions  on  the 
shape  of  the  cost  landscape  in  order  to  have  a  good  chance  of  working. 
We  have  not  investigated  its  applicability  to  the  dimensionality-reduction 
problem. 

Another  strategy  for  skirting  local  minima  involves  changing  the  shape 
of  the  cost  landscape  itself  as  training  proceeds.  The  idea  is  to  introduce 
a  parameter  that  makes  the  landscape  very  smooth  at  flrst,  so  that  the 


Figure  6.  a.  Neither  of  the  coet  Isndscapee  shown  has  a  local  mini¬ 
mum  by  itself,  b.  But  if  they  are  moved  near  one  another  and  added, 
local  mininia  can  be  created  in  the  resulting  landscape. 


network  may  easily  converge  to  the  local  (and  global)  minimum.  Then, 
gradually  reduce  this  parameter  to  slowly  change  the  landscape  back  into 
the  "bumpy”  cost  landscape  whose  minimum  defines  the  network  behavior 
actually  desired.  A  variant  on  this  technique  has  been  used  by  Hopfield 
and  Tank  to  train  networks  to  find  good  (but  not  optimal)  solutions 
to  the  travdiag  salesman  problem  (see  also  [Koch,  et  al,  1965]). 

For  the  duBeawonnlity  reduction  problem  we  take  as  the  coot  landscape 
Mooethuig  parameter  the  parameter,  u,  the  smearing  functkm,  At 
the  hegnnuag  of  a  training  session,  the  activity  in  all  scalar-sets  descrilMng 
scalar-valasd  nun^ers  is  smeared  across  virtually  all  of  the  units  within 
each  scalar-set.  Figure  4b  illustrates  the  activity  across  a  scalar-set  under 
a  mriety  of  smoothing  parameters,  w. 

Thb  strategy  creates  two  related  effects.  First,  it  reduces  the  preci- 
SMB  to  adtich  the  data  values  presented  as  input  activity,  and  sought  by 
the  output  error  t«m,  are  resolved.  Thus,  local  kinks  and  details  of  any 
curve  constraining  the  input  data  are  bharred  over  more  or  leas, 
w.  Seeoad,  under  smearing  with  a  large  u>,  aaxifiary  error 
OB  As  ksMsB  layers  pressures  each  unit’s  activity  to  be  not  too  different 
r’s  activity.  The  activity  in  hidden  unit  layers  is  thereby 
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encouraged  to  orgamze  itself  into  adopting  the  scalarised  representation. 

Training  begins  with  the  smearing  parameter,  u;,  set  to  a  large  value. 

The  parameter  is  gradually  reduced  to  its  final,  highest  resolution  smearing 
value  according  to  a  training  schedule.  Typically,  a  trmning  protocol  might 
involve  several  thousand  data-sampling/weight-updating  trials  for  each  of 
five  intermediate  values  for  <jJ. 

8  Performance 

The  performance  of  the  connectionist  dimensionality-reducer  on  two-dimensional 
data  constrained  to  lie  on  a  one-dimensional  constraint  surface  is  illustrated 
in  figure  7.  X’s  represent  locations  indicated  by  output  activity  computed 
by  the  network  when  the  input  is  drawn  from  points  on  the  constraint  curve; 
the  extent  to  which  X’s  lie  on  the  curve  simply  demonstrates  that  network 
,  output  conforms  to  input.  Numbers  shown  are  scalar  values  indicated  by 
g  scalarized  activity  in  the  hidden  layer  scalar^set  for  inputs  sampled  from 

\ 

\ 

0.15 


\  0.19 


Figure  7.  a.  One-dimensional  constraint  in  two-dimensional  data. 
X's  denote  network  output  when  input  is  takes  from  the  constraint 
curve  shown.  Numbers  indicate  scalar  value  at  hidden  layer  for  points 
along  constraint  curve,  b.  Sample  of  network  activity  for  one  location 
along  constraint  curve.  Note  that  output  activity  matches  input 
activity,  and  that  hidden  layer  scalar  set  activity  takes  a  unimodal, 
scalarized,  pattern. 
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loc&tiolui  along  the  constraint  curve.  Due  to  the  self-organising  character  of 
the  training  procedure,  these  numbers  progress  in  an  orderly  fashion  from 
one  end  of  the  constnunt  curve  to  the  other.  Figure  7b  displays  network 
activity  for  one  data  point  drawn  from  the  constraint  curve.  Note  that 
output  activity  matches  input  activity,  and  that  the  activity  of  the  hidden 
layer  adopts  a  unimodal,  scalarised  pattern.  Note  also  that  in  this  case  the 
connectionist  dimenuonality-reducer  captures  the  constraint  surface  even 
though  it  doubles  back  on  itself  in  both  the  x  and  y  dimensions. 

Figure  8  illustrates  a  case  in  which  a  connectionist  dimensionality- 
reducer  is  able  to  discover  a  constnunt  surface  given  noisy  training  data. 
During  training,  data  samples  were  drawn  randomly  from  the  band  pic¬ 
tured.  Shown  slightly  offset  from  this  band,  numbers  indicate  hidden 
scalar-set  encoding  of  locations  along  the  constraint  curve  found  by  the 
dimensionality-reducer  (indicated  by  x*s). 

Figure  9  shows  successful  dimensionality-reduction  given  relatively  sparsely 
sampled  data.  During  training,  data  samples  were  drawn,  at  random,  only 
from  the  points  shown  by  X'a  After  training,  the  dimensionality-reducer 
nonetheless  correctly  interprets  data  at  all  locations  along  the  length  of  the 
curve.  Empirical  investigation  indicates  that  during  training  the  constraint 
curve  must  be  sampled  no  more  sparsely  than  approximately  three  data 
points  per  hidden  layer  unit. 

Some  types  of  constraint  surfsce  cannot  be  discovered  by  the  connec- 
tionist  dimensionality-reducer.  These  are  curves  that  double  back  on  them¬ 
selves  radically.  Figure  10  illustrates  a  case  where  hidden  layer  activity  does 
not  display  a  scalarised  pattern  of  activity  representing  an  orderly  progres¬ 
sion  of  scalar  vahies  for  successive  locations  on  the  constraint  curve.  The 
reason  for  this  failure  is  that  points  such  as  pi  and  pa  in  figure  10  appear 
indtstinguiahable  to  the  network  early  in  the  training  procedure  when  Su 
causes  very  heavy  smearing  of  their  coordinate  representations.  They  are 
therefore  assigned  similar  encodings  in  the  hidden  unit  layer.  As  u>  is  de¬ 
creased,  later  in  the  training  procedure,  the  network  remains  stuck  in  a 
local  minimum  of  trying  to  encode  both  pi  and  pj  using  nearby  hidden 
scalar  values,  when  in  fact  it  turns  out  that  they  are  on  opposite  ends  of 
the  constraint  curve  and  so  should  be  assigned  very  different  encodings  in 
tha  hidden  layer  scalar-set.  The  cost  landscape  sculpting  strategy  does  not 
work  when,  as  the  landscape  smoothing  parameter  is  decreased,  the  global 
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Fiiure  9.  Dimeneionnlity-reduction  under  epnnely  enmpltd  tmining 
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Figure  10.  Unsucceeeful  dimennonality-teduction.  Wheo  the  con- 
etreint  lurface  doublee  back  oo  itself  radically,  initial  alignment  of 
the  network's  early  rough  approximation  to  the  data  is  not  untan¬ 
gled  as  smearing  width  parameter  w  decreases  later  in  the  training 
protocol,  a.  Numbers  indicating  best  scalar-value  interpretations  of 
hidden  layer  activity  do  not  progress  in  an  orderly  fashion  from  one 
end  of  the  conttraint  curve  to  the  other,  b.  Sample  of  network  activ¬ 
ity,  after  training,  for  one  location  on  the  constraint  curve.  Hidden 
layer  does  not  display  sealarised  pattern  of  activity. 

minimum  in  cost  suddenly  appears  in  a  very  different  location  in  weight 
space  from  where  the  previous  local  minimum  had  been.  Clearly,  then,  a 
network  cannot  be  proved  to  converge  to  dimensionality-reducing  behav¬ 
ior  in  the  general  case,  which  includes  pathologically  contorted  constraint 
surfaces.  However,  once  the  training  procedure  is  completed,  it  is  a  straight¬ 
forward  matter  to  detect  whether  or  not  dimensionality-reduction  has  been 
achieved,  simply  by  san4>ling  the  data  source,  and  determining  whether 
the  network  m;^  input  activity  into  (nearly)  identical  output  activity  via 
pR^>erly  sealarised  hidden  layer  activity. 

Prior  to  training  a  dimensionality-reducer,  it  is  important  to  select  a 
dimensionality  for  the  hidden  layer  to  match  the  inherent  dimensionality  of 
the  surface  constnuning  the  data.  The  connectionist  dimensionality-reducer 
provides  no  means  for  doing  this  automatically.  However,  again,  it  is  easy 
to  detect  whether  the  constrmnt  surface  is  of  inadequate  dimensionality, 
because  under  this  condition  a  network  will  converge  to  a  state  in  which 
it  does  imt  correctly  map  activity  at  the  input  layer  to  (nearly)  identical 
actiyity  at  the  output  layer  in  terms  of  a  unimodal,  sealarised  pattern  of 
activity  at  the  hidden  layer. 


The  method  expands  readily  to  larger  high-dimensional  feature  spaces 
simply  by  adding  more  scalar-seta  at  the  input  and  output  layers  to  rep¬ 
resent  additional  scalar  components  of  the  feature  vector.  Figure  11  illus¬ 
trates  the  n  =  3,  m  =  2  case.  Figure  11a  Is  the  true  underlying  constraint 
surface.  Figure  11b  represents  network  output  for  input  data  drawn  from 
the  constraint  surface.  Figure  11c  illustrates  network  output  when  activ¬ 
ity  corresponding  to  successive  {hi,ht)  pairs  (0  <  hi  <  1,  0  <  /ij  <  1)  is 
injected  directly  into  the  hidden  layer. 

However,  the  tractability  of  discovering  noany-dimensional  constraint 
surfaces  degrades  qmckly  as  the  dimensionality  of  the  hidden  layer  con¬ 
straint  surface  increases.  The  amoimt  of  data  that  must  be  analyzed  in 
order  to  establish  a  constraint  surface  increases  as  the  power  of  the  sur¬ 
face’s  dimensionality,  and  the  cost  in  terms  of  network  links  and  nodes 
increases  accordingly.  To  ove  an  idea  of  computing  costs,  a  training  pro¬ 
tocol  of  2000  trials  for  each  of  five  values  of  the  smearing  parameter,  ut, 
takes  approximately  30  minutes  for  the  n  =  3,m  =  1,  case,  with  resolution 
of  eight  units  per  scalar-set,  on  a  Symbolics  3600  lacking  floating  point 
hardware,  while  the  n  =  3,  m  =  2  case  takes  three  hours. 

To  illustrate  the  application  of  dimensionality-reduction  to  real  data, 
figure  12  shows  a  set  of  bananas  that  were  oripnally  described  in  terms  of 
six  properties  crudely  measured  on  the  banana  shapes,  such  as  the  distance 
between  the  ends  and  average  curvature  of  various  contour  segments.  By 
training  a  connectionlst  dimensionality-reducer  on  these  data  samples,  the 
bananas  were  found  to  lie  on  a  two-dimensional  constnunt  surface  in  the 
six-dimensional  feature  space.  The  organization  of  this  constraint  surface 
is  illustrated  in  the  figure;  bananas  are  placed  on  a  plane  according  to 
their  respective  two-dimensional  coordinates.  Note  that  banana  shapes 
are  organized  on  the  basis  of  very  subtle  differences  in  their  geometrical 
properties. 

Although  the  reduced  dimensionality  representation  concisely  encodes 
the  essential  parameters  of  variability  among  members  of  the  data  class 
falling  on  a  constraint  surface,  the  lower-dimensional  coordinate  axes  do 
not  necessarily  align  with  interpretations  of  these  parameters  preferred  by 
human  observers.  For  example,  the  horizontal  and  vertical  axes  of  figure 
12  roughly  correspond  to  curvature  of  the  lower  part  of  a  banana,  and 
banana  size,  respectively,  however,  the  dimensionality-reduction  training 


\ 


procedure  run  agun  on  the  aame  banana  data  might  mirror  reflect  these 
axes,  or  rotate  them  an  arbitrary  amount  in  the  plane. 

9  Discussion 

The  connectionist  dimensionality-reducer  is  able  to  capture  a  wide  class  of 
lower-dimensional  constraint  surfaces  embedded  in  high  dimensional  feature 
spaces,  even  when  the  constraint  surface  curves  to  a  considerable  degree. 
The  important  distinction  from  previotis,  linear  transformation  approaches 
to  dimensionality-reduction  is  that  the  connectionist  approach  enables  the 
system  to  maintain  a  great  deal  of  knowledge  about  constraint  on  the  data 
source  reflected  in  data  samples.  This  knowledge  is  contained  in  the  weights 
between  units  in  successive  layers  of  the  network.  Note  that  nowhere  is 
the  constraint  surface  described  explicitly;  its  shape  remains  implicit  in 
the  weight  connections.  In  contrast,  only  a  few  parameters  are  available 
to  a  linear  transformation,  which  must  therefore  approximate  a  complex, 
curving  constraint  surface  by  a  linear  surface. 

Analysis  of  a  dimensionality-reducing  network  after  the  completion  of 
training  indicates  that  local  regions  of  the  hidden  layer  scalar-set  map  to 
local  repons  of  the  constraint  surface,  in  a  topology-preserving  fashion. 


Figure  19.  Local  regions  of  a  hidden-layer  scalar  set  represent  local 
regions  of  tbs  embsddsd  constraint  surface,  in  a  topology-presenring 
laekioa.  For  example,  regions  on  a  two-dimsnsional  constraint  sur¬ 
face  are  represented  by  local  patches  of  the  hidden  layer  scalar-set. 


For  example,  a  data  sample  drawn  from  the  constraint  surface  at  the  loca¬ 
tion  shown  in  figure  13  would  give  rise  to  scalarized  hidden  layer  activity 
centered  at  a  corresponding  location  in  the  hidden  layer  scalar-set. 

The  connectionist  dimensionality-reducer  described  here  bears  modest 
commonality  with  the  method  of  Kohonen  [1984],  Kohonen’s  method, 
which  is  based  on  his  theory  of  the  topographic  mappings  achieved  by 
cells  in  the  brain,  also  uses  a  large  number  of  simple  computing  elements  in 
whose  connections  are  contained  knowledge  of  a  constraint  surface.  How¬ 
ever,  his  method  uses  a  very  different  type  of  self-organizing  algorithm  that 
confounds  the  shape  of  the  underlying  constraint  surface  with  the  proba¬ 
bility  distribution  of  data  samples  over  that  surface.  The  present  method 
differs  from  Kohonen’s  in  the  employment  of  the  backpropagation  scheme 
for  training  multilayer  networks,  and  in  the  explicit  use  of  the  landscape 
smoothing  parameter,  a>,  to  avoid  local  minima  during  training.  The  rep¬ 
resentation  of  scalar  values  in  sets  of  units  by  use  of  a  smearing  function  is 
similar  to  ‘Sralue  unit”  coding  described  in  [Ballard,  1086]. 

10  Conclusion 

\A’e  have  presented  a  mechanism  for  performing  dimensionality  reduction 
over  data  constrained  to  lie  on  a  lower-dimensional  surface  embedded  in 
a  high-dimensional  data  feature  space.  A  technique  is  given  for  represent¬ 
ing  in  connectionist  networks  the  scalar  components  of  continuous  vector¬ 
valued  data.  An  auxiliary  error  pressure  is  introduced  in  order  to  pressure 
hidden  imits  in  the  network  into  adopting  this  representation  for  scalar 
values. 

This  method  has  been  shown  capable  of  capturing  lower-dimensional 
constraint  surfaces  which  curve  to  a  considerable  degree.  The  network  con¬ 
structed  by  this  method  organizes  and  maintains  knowledge  of  constraint 
latent  in  a  set  of  data  in  order  to  encode  information  in  a  more  concise 
representation  than  its  original  description  as  a  high-dimensional  feature 
vector.  The  connectionist  dimensionality-reducer  may  also  be  viewed  as  a 
means  for  performing  pattern  matching  to  a  variable,  parameterized,  pat¬ 
tern  template  given  by  the  locxis  of  points  comprising  the  constraint  surface 
in  the  embedding  feature  space. 
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